!pip install folium
!pip install lxml
Annie Lu
CMSC320
Final Tutorial Project
Published December 15th, 2018
Last updated TBA, 2018
Despite being a developed country, the United States has considerable amount of income inequality. As an American issue, I want to explore (1) where income inequality resides using maps and (2) the various factors which could affect income. Since the dataset is large, I will look at this at regional level, the East Coast, rather than a national level. Specifically, nearby DC, Maryland, Virginia, Pennsylvania, Delaware, and New Jersey. Another objective of this tutorial is to try understanding the American Community Survey (ACS), because it seems to be a reliable, applicable source for U.S populations.
Quite a few libraries are used in this tutorial. Some fundamentals for data science are pandas and numpy. Pandas provides flexible data structures, series and dataframes, along with useful functions to manipulate them. Numpy is a powerful mathematical library which performs fast calculations since it is based in C.
import pandas as pd
import numpy as np
Libraries for Data Collection:
from bs4 import BeautifulSoup
import requests
Libraries for Data Analysis and Data Visualization:
import matplotlib.pyplot as plt
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import seaborn as sns
import folium
import branca
I'll also be using a tool called IPUMS which helps create a dataset of selected variables from one or multiple samples, such as the American Community Survey (ACS).
My data analysis/visualizations will, later on, rely on maps. Thus, it is necessary to collect geographical and location data. For a proper map with boundaries, I searched for json files which plotted either state or county boundaries by points. However, it was not enough to simply contain the boundaries; there also needed to be feature identification or properties, like state abbreviations and FIPS. The json files I ended up finding for the map, required corresponding state abbreviations and FIPS.
Some datasets did not have postal codes or FIPS, so I found a table online with both and parsed it using Beautiful Soup library.
r = requests.get("https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696")
state_code = BeautifulSoup(r.text, "html.parser")
state_code = state_code.find(id="detail")
state_code = state_code.prettify() #readable html table format
state_code = pd.read_html(state_code) #returns list of dataframes with only [0] index
state_code = state_code[0]
state_code.columns=state_code.loc[0]
state_code=state_code.drop(0,0)
state_code.head()
Finding population was necessary to understand why some counties were covered and not others. Most often, counties that were densely populated/largest would have identified FIPS.
county_population = pd.read_csv("county_population.csv", sep=',')
county_population=county_population[['STATE','COUNTY','STNAME','CTYNAME','POPESTIMATE2017']]
IDFIP = []
PERCENTAGE = []
for index,row in county_population.iterrows():
if(row['COUNTY']==0):
total_pop=row['POPESTIMATE2017']
PERCENTAGE.append(1)
else:
PERCENTAGE.append(row['POPESTIMATE2017']/total_pop)
fip=""
if(len(str(row['STATE']))==1):
fip+=str(0)
fip+=str(row['STATE'])
if(len(str(row['COUNTY']))==2):
fip+=str(0)
else:
if(len(str(row['COUNTY']))==1):
fip+=str(0)+str(0)
fip+=str(row['COUNTY'])
IDFIP.append(fip)
county_population['IDFIP']=IDFIP
county_population['PERCENTAGE']=PERCENTAGE
county_population.head()
To analyze income, we need data about the USA people and households. A staple dataset for this is the American Community Survey. A brief overview of the American Community Survey: this survey from the U.S Census Bureau collects data from American households on a monthly basis. There are a lot of variables which are recorded, including race, gender, age, income, household, etc.
There are various ways to collect ACS data, ranging in difficulty and flexibility. In general, from least to most difficulty/flexibility are the following: Census Reporter, American FactFinder, and IPUMS. In this tutorial, I will be using IPUMS. This tool allows us to custom select variables and samples ahead of downloading the file. The file can be formatted multiple ways, but I chose csv or "comma separated values".
The dataset has a massive amount of rows, yet still only covers 1% of the population. In general, for large groups like the entire US and states, this should be fine for drawing conclusions. The margin of error is somewhat insignificant at large levels. To prevent my computer from crashing and burning, I took a subset of this dataset, and focused on a state/regional level.
us=pd.read_csv("usa_00006.csv", sep=',')
maryland=us.loc[us['STATEFIP'] == 24]
dc=us.loc[us['STATEFIP'] == 11]
virginia=us.loc[us['STATEFIP'] == 51]
penn=us.loc[us['STATEFIP'] == 42]
delaware=us.loc[us['STATEFIP'] == 10]
jersey=us.loc[us['STATEFIP'] == 34]
data=pd.DataFrame()
data=data.append(maryland)
data=data.append(dc)
data=data.append(virginia)
data=data.append(penn)
data=data.append(delaware)
data=data.append(jersey)
data.head()
The ACS accounts for both households and people. Since multiple people exist within a household, there was some duplicate information, so I melt the rows together. I also decided to drop default/missing values.
There are multiple ways to define income and income inequality. One way is by taking the total household income and dividing it by the square root of people in the household. The square root is to take into account to shared resources and equivalent consumer expenditures.
def adjust_income(d):
adjusted_hhincome=[]
d = d[d.FTOTINC != 9999999]
d=d[d.FTOTINC != 9999998]
household=d.drop_duplicates(subset='SERIAL',keep='first')
for e, df_household in d.groupby('SERIAL'):
adjusted_hhincome.append((df_household['FTOTINC']/np.sqrt([len(df_household)])[0]).values[0])
household['Adjusted_Income']=adjusted_hhincome
household=household.drop(columns=['PERWT','PERNUM','YEAR','DATANUM','SERIAL','CBSERIAL'])
return household
east_coast=adjust_income(data)
east_coast.head()
Next I calculated income statistics of a given demographic. The ACS dataset has weighted households, meaning some households represent more of a population that others, so I had to take that into account while finding medians/percentile cut-offs.
Taking the median income, rather than the average income can give a more accurate representation of the general population. Median is robust, unlike averages, so the statistic will not be heavily effected by outliers, like extremely wealth people.
There a many ways of calculating income inequality. A straightforward approach is the decile ratio, where you take the bottom 10% cutoff and divide that by the top 90% cutoff. When processing weighted statistics, numpy functions can make the process easier by, for instance, returning an array's summation.
def income_stat(d):
#finding adjusted income inequality
d.sort_values('Adjusted_Income', inplace=True)
cumsum = np.cumsum(d['HHWT'])
cutoff = np.sum(d['HHWT']) / 2.0
ninety = np.sum(d['HHWT']) * 0.9
ten = np.sum(d['HHWT']) * 0.1
median = d.Adjusted_Income[cumsum >= cutoff].iloc[0]
num=d.Adjusted_Income[cumsum >= ninety].iloc[0]
denom=d.Adjusted_Income[cumsum >= ten].iloc[0]
income_inequality=num/denom
return [d['STATEFIP'].iloc[0],d['COUNTYFIPS'].iloc[0],median,num,denom,income_inequality]
print(income_stat(east_coast))
income_stats=pd.DataFrame(columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio'])
for state in [maryland,dc,virginia,penn,delaware,jersey]:
state=adjust_income(state)
income_stats=income_stats.append(pd.DataFrame([income_stat(state)], columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio']))
income_stats
def counties_stat(d):
county_fips=d['COUNTYFIPS'].unique()
county_stats=pd.DataFrame(columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio'])
for county_fip in county_fips:
x=d.loc[d['COUNTYFIPS']==county_fip]
x=adjust_income(x)
county_stats=county_stats.append(pd.DataFrame([income_stat(x)],columns=['State','County','Median','Ninetieth','Tenth','90/10 Ratio']))
return county_stats
counties_map_data=counties_stat(maryland)
counties_map_data=counties_map_data.append(counties_stat(virginia))
counties_map_data=counties_map_data.append(counties_stat(dc))
counties_map_data=counties_map_data.append(counties_stat(penn))
counties_map_data=counties_map_data.append(counties_stat(delaware))
counties_map_data=counties_map_data.append(counties_stat(jersey))
counties_map_data.head()
Postal_Code = []
for index,row in income_stats.iterrows():
x=str(row['State'])
if(len(state_code.loc[state_code['FIPS']==x]['Postal Code'].values)==0):
Postal_Code.append(' ')
else:
Postal_Code.append(state_code.loc[state_code['FIPS']==x]['Postal Code'].values[0])
income_stats["Postal Code"] = Postal_Code
income_stats.head()
state_geo='us-states.json'
m = folium.Map(location=[40, -75], zoom_start=6)
style = {'fillColor': '#00000000', 'color': '#00000000'}
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=income_stats,
columns=['Postal Code', 'Median'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.9,
legend_name='Median Income',
highlight=True
).add_to(m)
highlight={'fillColor': '#000000',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['name'],aliases=['State:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.LayerControl().add_to(m)
m
state_geo='us-states.json'
m = folium.Map(location=[40, -75], zoom_start=6)
style = {'fillColor': '#00000000', 'color': '#00000000'}
folium.Choropleth(
geo_data=state_geo,
name='choropleth',
data=income_stats,
columns=['Postal Code', '90/10 Ratio'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.9,
legend_name='Income Inequality',
highlight=True
).add_to(m)
highlight={'fillColor': '#000000',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['name'],aliases=['State:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.LayerControl().add_to(m)
m
IDFIP = []
for index,row in counties_map_data.iterrows():
if(len(str(row['County']))==1):
IDFIP.append(str(row['State'])+str(0)+str(0)+str(row['County']))
else:
if(len(str(row['County']))==2):
IDFIP.append(str(row['State'])+str(0)+str(row['County']))
else:
IDFIP.append(str(row['State'])+str(row['County']))
counties_map_data["IDFIP"] = IDFIP
counties_map_data.head()
state_geo='cb_2017_us_county_20m.json'
m = folium.Map(location=[40, -75], zoom_start=6)
folium.Choropleth(
geo_data=state_geo,
name='average income',
data=counties_map_data,
columns=['IDFIP', 'Median'],
key_on='feature.properties.GEOID',
fill_color='YlGn',
fill_opacity=0.9,
line_opacity=1,
legend_name='Median Income',
highlight=True
).add_to(m)
highlight={'fillColor': '#ffaf00',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['NAME'],aliases=['County:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.LayerControl().add_to(m)
m
state_geo='cb_2017_us_county_20m.json'
m = folium.Map(location=[40, -75], zoom_start=6)
folium.Choropleth(
geo_data=state_geo,
name='average income',
data=counties_map_data,
columns=['IDFIP', '90/10 Ratio'],
key_on='feature.properties.GEOID',
fill_color='YlGn',
fill_opacity=0.9,
line_opacity=1,
legend_name='Income Inequality',
highlight=True
).add_to(m)
highlight={'fillColor': '#ffaf00',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['NAME'],aliases=['County:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.LayerControl().add_to(m)
m
county_geo='cb_2017_us_county_20m.json'
m = folium.Map(location=[40, -75], zoom_start=6)
highlight={'fillColor': '#ffaf00',
'color': 'green',
'weight': 3,}
folium.GeoJson(state_geo, name='tooltip',
tooltip=folium.features.GeoJsonTooltip(fields=['NAME'],aliases=['County:']), highlight_function=lambda a: highlight,
style_function=lambda x: style).add_to(m)
folium.Choropleth(
geo_data=county_geo,
name='population',
data=county_population,
columns=['IDFIP', 'PERCENTAGE'],
key_on='feature.properties.GEOID',
fill_color='YlGn',
fill_opacity=1,
line_opacity=0.9,
legend_name='Population',
highlight=True
).add_to(m)
folium.LayerControl().add_to(m)
m
avoid=data['INCTOT'].isin(['9999999','9999998'])
data=data[~avoid]
county=data.groupby(['STATEFIP','COUNTYFIPS'])
state=data.groupby(['STATEFIP'])
state=state.mean()
county=county.mean()
avoid=data['FTOTINC'].isin(['9999999','9999998'])
data=data[~avoid]
avoid=data['INCWAGE'].isin(['9999999','9999998','0'])
data=data[~avoid]
print(data.median())
data_sex=data.groupby(['SEX'])
print(data_sex.mean())
data_race=data.groupby(['RACE'])
print(data_race.median())
data_race=data.groupby(['SEX','RACE'])
print(data_race.median())